========================================================
We will analyze the red wind quality dataset by using R and apply exploratory data analysis techniques to investigate and explore the relationship in the dataset from difference anlges one variables , two varibales , multi-variables. Further , we will see the disturbtion of the data and outliers.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## 7 7 7.9 0.60 0.06 1.6 0.069
## 8 8 7.3 0.65 0.00 1.2 0.065
## 9 9 7.8 0.58 0.02 2.0 0.073
## 10 10 7.5 0.50 0.36 6.1 0.071
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## 7 15 59 0.9964 3.30 0.46 9.4
## 8 15 21 0.9946 3.39 0.47 10.0
## 9 9 18 0.9968 3.36 0.57 9.5
## 10 17 102 0.9978 3.35 0.80 10.5
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## 7 5
## 8 7
## 9 7
## 10 5
summary(RW$X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 400.5 800.0 800.0 1200.0 1599.0
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The dataset contains 1599 redwine instances along with 13 variables. We have added one categorical variable that represents the quality varibales for the red wine. Moreover, the whole variables are numerical but the variable that we created above. Most the red wine rate is average.
The main feature that interests me in the dataset is the quality of the redwine especilly with alcohol. I want to see if there’s any correlation between these two features along with others.
I will investigate the other features like ph, density, acidity (critic.acid,fixeda.cidity). Further, residul.Sugar and total.sulfur.dioxed might have affect on the people’s taste of the red wine.
Yes , I did create rate variable based on the quality variable which help us in our analysis to simplify.
I almost investigated most the features, there was unusual distributions in two plots above critic.acid and alcohol. We will investigate more about these features but with other variables.
We did invesigation about the relationships of vairables for the red wine dataset ,and we found out the below: It seems that there’s a relationship between the quality of wine and the concentration of alcohol. However,there is a strong relationship between citric acid and the quality of red wine the more citric acide concentration , the better quality the red wine will be.
we have checked my relatiohships between the other features such as pH with density and citric acid and total sulfur dioxide and free sulfur dioxide. we found there are some relationships between these variables. Nevertheless , there is a strong relatioship between total and free sulfur dioxide variables as can be seen in the above graphs .
I think the strongest relationship that I found was the relationship between total and free sulfur dioxide variables. Further, the citric acid with Fixed acidity variable.
cor(RW$free.sulfur.dioxide,RW$total.sulfur.dioxide)
## [1] 0.6676665
cor.test(RW$citric.acid, RW$fixed.acidity)
##
## Pearson's product-moment correlation
##
## data: RW$citric.acid and RW$fixed.acidity
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
cor.test(RW$citric.acid, RW$quality)
##
## Pearson's product-moment correlation
##
## data: RW$citric.acid and RW$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
cor.test(RW$fixed.acidity, RW$quality)
##
## Pearson's product-moment correlation
##
## data: RW$fixed.acidity and RW$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
We looked into different relationships between the variables in the red wine datasets and we found out that there is a strong relationship between fixed acidity and citric acid , and they are correlated stronglly to each other. Further, there is relationship between Density and pH variables along with red wine quality ,we can see that the low level of pH can have both high density and excellent quality of red wine instane. Further, it can have poor quality of red wine for the pH. it gives us a clue that the other variables could alos have an influence on the quality of red wine too.
Yes ,there are many interesting interactions between the total sulfur dioxide and alcohol even though there are some outliers in the excellent quality for the red wine. Moreover, there are interesting interactions between citric acid and fixed acidity variables . ### OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.
The plot one shows the relationship between fixed acidity with critic acid. Further, they are correlated to each other by 0.6717034.
The plot two shows the relationship between Density and pH variables along with red wine quality ,we can see that the low level of pH can have both high density and excellent quality of red wine instane. Further, it can have poor quality of red wine for the pH. it gives us a clue that the other variables could alos have an influence on the quality of red wine too.
The plot three shows a strong relationship between fixed acidity and citric acid , and they are correlated stronglly to each other by 0.67. However, I tried to get the correlation estimation between citrix acid or fixed acidity and We found out the correlation between citric acid with quality stronger(0.2263725) than fixed acidity (0.1240516)
We have investigated the red wine data set which has 1599 instance and 13 variables. Further, we created one categorical variable to represent the rate of the red wine quality. Moreover , one variable , two variables , Mulit-variables plot were created through out the above investigations. We found out that there are many varibales are correlated to each others like citric acid and fixed acidity. In addition , Many other factors may affect the quality of the red wine instance like total sulfur toxidie , alcohol and other variables. We run out to some issues in calcualting the correlation between the quality and other variables . So,we had to tranfer the quality from integer to numberic variables as we did above to be able to get the estimation of correlation between chemical factors and quality of the red wine instance. This project was a great excercise and lesson for me even I have a lot and want to do like correlation matrix and heatmap but maybe in future works and courses .